Cardiovascular disease (CVD) continues to be a predominant cause of mortality worldwide, emphasizing the critical need for accurate and timely risk prediction systems. Machine Learning (ML) approaches have increasingly demonstrated their value in supporting clinical decision-making; however, challenges remain regarding model robustness, reproducibility, and real-world applicability. This study presents a comparative evaluation of several supervised ML classifiers—Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Decision Tree (DT), and the Random Forest (RF) ensemble—for binary classification of heart disease using the widely adopted UCI Heart Disease dataset. Unlike conventional offline evaluations, this work also integrates a practical user-interactive interface developed through R Studio, enabling users to input patient attributes directly and observe predictive outcomes dynamically. The models were validated over multiple trials to ensure stability and generalization. Experimental findings indicate that the RF classifier provides superior predictive performance with improved reliability, reinforcing its suitability for deployment in real clinical decision-support environments.
Introduction
Cardiovascular disease (CVD) remains the leading global cause of death, making early and accurate prediction critical. Traditional diagnostic methods often lack predictive power, which has led to increased use of machine learning (ML) for improved risk assessment and clinical decision support.
This study compares multiple supervised ML models—Logistic Regression, SVM, KNN, Decision Tree, Naïve Bayes, LVQ, and Random Forest—using the UCI Heart Disease dataset. The goal is to identify the most reliable model for binary CVD prediction while ensuring reproducibility through multiple runs, cross-validation, and hyperparameter tuning. An interactive R Shiny interface is also developed for real-time user predictions.
Results show that Random Forest performs best overall, offering high accuracy and stability due to its ensemble nature. Logistic Regression also performs consistently and is useful for low-risk screening, while KNN and LVQ show higher variability due to sensitivity to parameter settings. Evaluation uses metrics such as accuracy, precision, recall, specificity, and F1-score, with a strong emphasis on sensitivity due to medical importance.
The study highlights key limitations in existing research, particularly inconsistent evaluation methods and overfitting in small datasets. Future work focuses on larger multi-source datasets, improved hyperparameter optimization, hybrid deep learning models, and better explainability for clinical use.
Conclusion
This study presents a comprehensive performance comparison of multiple supervised machine learning classifiers for the early prediction of cardiovascular disease using the UCI Heart Disease dataset. The evaluation findings consistently indicate that ensemble-based learning approaches provide superior predictive reliability compared to traditional single-model methods. Among the tested models, Random Forest achieved the most stable overall performance, reflected by its highest mean accuracy of 92.00% and an F1-Score demonstrating balanced precision and recall. These results reinforce the established understanding that ensemble techniques effectively reduce model variance and enhance generalization when dealing with complex and heterogeneous medical features.
Logistic Regression also performed notably well, achieving an average accuracy of 88.30% and demonstrating strong capability in identifying true positive cases, which is essential in clinical environments where undetected heart disease cases can result in life-threatening outcomes. Thus, despite its simplicity, Logistic Regression remains a viable and interpretable diagnostic tool that can complement more advanced ensemble methods.
Overall, the findings confirm that machine learning can serve as a valuable computational aid to healthcare professionals, providing early risk prediction that supports proactive treatment decisions and reduces the likelihood of severe cardiac events. Future enhancements may involve incorporating larger datasets, integrating real-time physiological monitoring data, and applying deep learning architectures to further improve diagnostic precision.
References
[1] World Health Organization, “Cardiovascular diseases (CVDs),” 2024.
[2] V. K. M. et al., “Predicting cardiac disease using ensemble machine learning models on UCI datasets: A comparative analysis,” in Proc. Int. Conf. Health Informatics, 2023.
[3] D. K. et al., “The role of artificial intelligence and machine learning in early detection of cardiovascular diseases: A review,” Int. J. Biomed. Inform., vol. 13, no. 2, pp. 427–435, Feb. 2023.
[4] J. F. et al., “The future of AI/ML in cardiovascular risk assessment,” J. Am. Coll. Cardiol., vol. 82, no. 10, pp. 915–925, Sep. 2023.
[5] D. K. et al., “Systematic review and meta-analysis of machine learning models for cardiovascular diseases,” BMC Cardiovasc. Disord., 2024.
[6] S. M. et al., “Evaluating the calibration performance of machine learning models for cardiovascular disease prediction,” medRxiv, Preprint, 2025.
[7] A. T. et al., “Optimized SVM for heart disease prediction using K-Fold cross-validation method,” Int. J. Inf. Technol., vol. 4, no. 2, pp. 101–110, 2020.
[8] A. B. et al., “Effective cardiovascular disease prediction framework using machine learning techniques,” J. Clin. Inform. Med., 2024.
[9] M. Ozcan and S. Peker, “A classification and regression tree (CART) algorithm for heart disease modelling and prediction,” Turk. J. Comput. Math., 2022.
[10] A. Ogunpola et al., “Machine learning-based predictive models for detection of cardiovascular diseases,” Int. J. Eng. Sci. Math., 2023.